About TMDB Website:
TMDB is a community-built movie and TV database, specialised around movie, TV and actor data, the wesite currently have 4.4 millions (4,419,249) titles, 810,747 movies, and 139,637 tv shows in its database, as well as 83 million registered users. 1
The Movie Database (TMDB)
This Dataset is from Kaggle website, it contains 21 variables for 10866 movies (10571 unique movies), for 55 years, beginning in 1960 and ending in 2015.
Thousands of actors and actresses are listed, as well as 2399 unique director names.
Additionally, the data includes vote_average, which indicates the average ratings the movie received, the length of the movie in minutes, the release date of the movie, as well as its budget and profit.
Data Sources:
Original data on Kaggle
# Packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
# Load the data and print out a few lines using head() method.
PATH_URL = "https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv"
df_url = pd.read_csv (PATH_URL)
df_url.head(3) #print the first 3 records
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
3 rows × 21 columns
print('Shape of the data is ',df_url.shape) # return a tuple with the lengths of the corresponding array dimensions.
Shape of the data is (10866, 21)
Note:
This dataset has 10866 Observations, and 21 Variables.
df_url.columns #print all the indicators
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
# Perform operations to inspect data types and look for instances of missing
# or possibly errant data.
df_url.info() # prints concise summary of the dataset
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB
Note:
From the data types we see that all features are object/int64/float64. we can see that release_date is an object type and we will try parsing the date using pandas.to_datetime function so it will be more useful.
# check how many NaN values and in which columns and converted to Ratio
percent_missing = round(df_url.isnull().sum() * 100 / len(df_url),2)
print("Variables Missing Ratio:")
print(percent_missing)
Variables Missing Ratio: id 0.00 imdb_id 0.09 popularity 0.00 budget 0.00 revenue 0.00 original_title 0.00 cast 0.70 homepage 72.98 director 0.40 tagline 25.99 keywords 13.74 overview 0.04 runtime 0.00 genres 0.21 production_companies 9.48 release_date 0.00 vote_count 0.00 vote_average 0.00 release_year 0.00 budget_adj 0.00 revenue_adj 0.00 dtype: float64
Note:
df_url.describe(exclude=[object]) # descriptive statistics excluding object columns
| id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 |
| mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 1.755104e+07 | 5.136436e+07 |
| std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 3.430616e+07 | 1.446325e+08 |
| min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
| 25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
| 50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
| 75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.369710e+07 |
| max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
df_url.describe(exclude=[np.number]) # descriptive statistics excluding numeric columns
| imdb_id | original_title | cast | homepage | director | tagline | keywords | overview | genres | production_companies | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10856 | 10866 | 10790 | 2936 | 10822 | 8042 | 9373 | 10862 | 10843 | 9836 | 10866 |
| unique | 10855 | 10571 | 10719 | 2896 | 5067 | 7997 | 8804 | 10847 | 2039 | 7445 | 5909 |
| top | tt0411951 | Hamlet | Louis C.K. | http://www.thehungergames.movie/ | Woody Allen | Based on a true story. | woman director | No overview found. | Comedy | Paramount Pictures | 1/1/09 |
| freq | 2 | 4 | 6 | 4 | 45 | 5 | 134 | 13 | 712 | 156 | 28 |
df_url[df_url.duplicated()] # find the duplicate rows, which will be removed
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2090 | 42194 | tt0411951 | 0.59643 | 30000000 | 967000 | TEKKEN | Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian... | NaN | Dwight H. Little | Survival is no game | ... | In the year of 2039, after World Wars destroy ... | 92 | Crime|Drama|Action|Thriller|Science Fiction | Namco|Light Song Films | 3/20/10 | 110 | 5.0 | 2010 | 30000000.0 | 967000.0 |
1 rows × 21 columns
Note:
We observe that there is one duplicate row in our dataset, which will be removed.
df_url["production_companies"].value_counts() # counts of unique values
Paramount Pictures 156
Universal Pictures 133
Warner Bros. 84
Walt Disney Pictures 76
Columbia Pictures 72
...
Silverwood Films 1
Lunch Box Entertainment 1
Atitude Produções e Empreendimentos 1
Mulmur Feed Company 1
Norm-Iris 1
Name: production_companies, Length: 7445, dtype: int64
df_url['vote_average'].unique() # print unique values
array([6.5, 7.1, 6.3, 7.5, 7.3, 7.2, 5.8, 7.6, 8. , 6.2, 5.2, 7.4, 6.1,
7. , 6.8, 5.3, 7.8, 6.4, 6.6, 7.7, 5.6, 6.9, 5.9, 6.7, 5.5, 5. ,
4.4, 5.4, 5.1, 4.8, 5.7, 4.1, 3.9, 4.5, 6. , 4.2, 3.6, 4.3, 4.9,
4.7, 4. , 3.5, 3.8, 3.3, 3.7, 4.6, 7.9, 8.2, 2.6, 3.1, 8.9, 3.2,
2.4, 8.4, 3. , 2.8, 3.4, 8.8, 8.1, 8.3, 2.7, 2.5, 2.1, 8.6, 2.9,
8.5, 9.2, 2.2, 2. , 8.7, 2.3, 1.5])
# 1st step : Drop unwanted cloumns / aren't relevant to our questions
df_url = df_url.drop(['homepage', 'tagline', 'overview'],axis = 1)
#print(df_url)
# 2nd step : Replaced null values in production_companies, cast, director with Unknown keywords
df_url = df_url.fillna({'production_companies':'Unknown', 'cast': 'Unknown', 'director': 'Unknown'})
#print(df_url)
# 3rd step : Drop duplicate row in the dataset
df_url.drop_duplicates(inplace=True)
df_url.reset_index(drop=True, inplace=True)
df_url.shape
#print(df_url)
(10865, 18)
df_url.isnull().sum() #check if the three steps above have worked
id 0 imdb_id 10 popularity 0 budget 0 revenue 0 original_title 0 cast 0 director 0 keywords 1493 runtime 0 genres 23 production_companies 0 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 dtype: int64
# 4th step : Fill in imdb_id null records with random numbers generator
import random
import string
import numpy as np
def imdb_id_generator(): # function that returns 7 random numbers with "tt" before it
randomString = ''.join(random.choice(string.digits) for i in range(7)) #Determine the Length of the random string as 7
imdb_id_value = 'tt' + randomString # add "tt" before the numbers
print(imdb_id_value ) # print the values
return imdb_id_value # return the values
# Go through imdb_id column and check if a record is null, if so fill it with the new generated id number
df_url['imdb_id'] = df_url.apply(lambda v: imdb_id_generator() if pd.isnull(v['imdb_id']) else v['imdb_id'], axis=1)
tt3966309 tt1493971 tt5252547 tt6229659 tt1262669 tt9481106 tt3889494 tt6544259 tt6320783 tt7031313
duplicate = df_url[df_url.duplicated('imdb_id')] # Finally we will check if there's any duplicate records after generating the numbers above
duplicate # print the records
| id | imdb_id | popularity | budget | revenue | original_title | cast | director | keywords | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj |
|---|
Note:
We can see there's no duplicated records.
# 5th step : Drop null records in keywords, genres columns
df_url = df_url.dropna(subset = ['keywords', 'genres']) # Drop all rows with NaNs in keywords and genres
df_url.isnull().sum()
id 0 imdb_id 0 popularity 0 budget 0 revenue 0 original_title 0 cast 0 director 0 keywords 0 runtime 0 genres 0 production_companies 0 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 dtype: int64
Note:
As the output shown, we get rid of / replaced all the missing values.
Now the data is clean and ready to use.
# 6th step - part(1) : change relase_date type to datetime to be useful
df_url['release_date'] = pd.to_datetime(df_url['release_date'])
df_url['release_date'].dtype
#type(df_url['release_date'])
dtype('<M8[ns]')
# 6th step - part(2) : Generate (month,day) columns from release_date column
df_url['release_day'] = (df_url['release_date']).dt.day
df_url['release_month'] = (df_url['release_date']).dt.month
df_url.head(3)
| id | imdb_id | popularity | budget | revenue | original_title | cast | director | keywords | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | release_day | release_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | monster|dna|tyrannosaurus rex|velociraptor|island | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | 9 | 6 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | future|chase|post-apocalyptic|dystopia|australia | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 2015-05-13 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 | 13 | 5 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | Robert Schwentke | based on novel|revolution|dystopia|sequel|dyst... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 2015-03-18 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 | 18 | 3 |
# - - - - Question(1) - 1st step: Create two dataframes for movies financial performance both grouped by years:
budget_year = df_url.groupby('release_year')['budget'].sum().reset_index(name = "budget")
revenue_year = df_url.groupby('release_year')['revenue'].sum().reset_index(name = "revenue")
# Change "budget_year" and "revenue_year" from object to dataframes type:
budget_year = pd.DataFrame(budget_year)
revenue_year = pd.DataFrame(revenue_year)
# Merge the twondataframes to one:
financial_performance_df = budget_year.merge(revenue_year)
financial_performance_df # print the new dataframe
| release_year | budget | revenue | |
|---|---|---|---|
| 0 | 1960 | 19056948 | 137905000 |
| 1 | 1961 | 46137000 | 337720188 |
| 2 | 1962 | 54722126 | 215579846 |
| 3 | 1963 | 73265000 | 187404989 |
| 4 | 1964 | 39483161 | 340981782 |
| 5 | 1965 | 70205115 | 458081854 |
| 6 | 1966 | 57554800 | 84736689 |
| 7 | 1967 | 100652200 | 737834637 |
| 8 | 1968 | 71939000 | 264732980 |
| 9 | 1969 | 41954087 | 243957076 |
| 10 | 1970 | 126961946 | 560221969 |
| 11 | 1971 | 75997000 | 404910610 |
| 12 | 1972 | 36279254 | 494730171 |
| 13 | 1973 | 65190783 | 1223981102 |
| 14 | 1974 | 76970000 | 812539818 |
| 15 | 1975 | 56279000 | 957489966 |
| 16 | 1976 | 122150000 | 801005600 |
| 17 | 1977 | 161580000 | 2180583159 |
| 18 | 1978 | 208997000 | 1353386648 |
| 19 | 1979 | 254814000 | 1684794913 |
| 20 | 1980 | 362500000 | 1768662387 |
| 21 | 1981 | 368612000 | 1754700877 |
| 22 | 1982 | 437795002 | 2444443852 |
| 23 | 1983 | 519107412 | 2297358384 |
| 24 | 1984 | 711211964 | 2537395361 |
| 25 | 1985 | 717520637 | 2825489230 |
| 26 | 1986 | 684033613 | 2956898540 |
| 27 | 1987 | 707655716 | 3386522116 |
| 28 | 1988 | 898923000 | 3725222566 |
| 29 | 1989 | 1067656360 | 5128933486 |
| 30 | 1990 | 1289922066 | 5298128773 |
| 31 | 1991 | 1453433000 | 4684739920 |
| 32 | 1992 | 1425325538 | 5999868833 |
| 33 | 1993 | 1748628653 | 6949429765 |
| 34 | 1994 | 2185807032 | 7086791105 |
| 35 | 1995 | 2819384377 | 9152179144 |
| 36 | 1996 | 3600042051 | 8278301474 |
| 37 | 1997 | 4681086675 | 10606757132 |
| 38 | 1998 | 4473560000 | 9492543717 |
| 39 | 1999 | 5624775106 | 11252268548 |
| 40 | 2000 | 5517200000 | 10724897412 |
| 41 | 2001 | 5562700000 | 13359237296 |
| 42 | 2002 | 5635640255 | 14102793054 |
| 43 | 2003 | 6114488167 | 15062817728 |
| 44 | 2004 | 7064140208 | 16704911389 |
| 45 | 2005 | 7194212349 | 16440981223 |
| 46 | 2006 | 7146782800 | 16156482955 |
| 47 | 2007 | 7289319004 | 19057469082 |
| 48 | 2008 | 7424116809 | 18637699537 |
| 49 | 2009 | 8223334056 | 22028168149 |
| 50 | 2010 | 8716734449 | 21359123379 |
| 51 | 2011 | 8717847976 | 23334587180 |
| 52 | 2012 | 8010784010 | 24384579732 |
| 53 | 2013 | 8739303605 | 24032354172 |
| 54 | 2014 | 7732194045 | 24024972183 |
| 55 | 2015 | 7445797557 | 26635501691 |
# - - - - Question(1) - 2nd step: Create a plot for movies financial performance grouped by years:
palette1 = ["#8BC1C2","#91B9A3"] # colour palette(1)
fig_budget_revenue = px.area(financial_performance_df, x='release_year', y=['budget','revenue'], color_discrete_sequence = palette1,
title="The Financial Performance of Movies Over Time", labels={'release_year':'year', 'value':'Thousands of millions ($)', 'variable':'Financial Performance'} )\
.update_traces(dict(marker_line_width=1, marker_line_color="black"))
fig_budget_revenue.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig_budget_revenue.show()
Clearly, we can see that over the years both the movie industry expenses and the revenues have been increasing exponentially, reaching the highest point in 2015 with $26 billion USD. which indicates the increase of the investors' interest in the industry due to its high profitability.
# - - - - Question(2) - 1st step: Calculate the profit value for each movie and determine if it's profitable or not
df_url['profit_value'] = df_url['revenue'] - df_url['budget']
df_url['profitable'] = np.where(df_url['revenue'] > df_url['budget'], 1, 0) #determine whether the movie is (profitable = 1) or (not = 0)
df_url.head(2)
| id | imdb_id | popularity | budget | revenue | original_title | cast | director | keywords | runtime | ... | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | release_day | release_month | profit_value | profitable | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | monster|dna|tyrannosaurus rex|velociraptor|island | 124 | ... | 2015-06-09 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | 9 | 6 | 1363528810 | 1 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | future|chase|post-apocalyptic|dystopia|australia | 120 | ... | 2015-05-13 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 | 13 | 5 | 228436354 | 1 |
2 rows × 22 columns
# - - - - Question(2) - 2nd step: Select the top 10 movies by most rating to analys thier financial performence
top_rating_movies = df_url.sort_values('vote_count',ascending=False).head(10)
top_rating_movies
| id | imdb_id | popularity | budget | revenue | original_title | cast | director | keywords | runtime | ... | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | release_day | release_month | profit_value | profitable | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1919 | 27205 | tt1375666 | 9.363643 | 160000000 | 825500000 | Inception | Leonardo DiCaprio|Joseph Gordon-Levitt|Ellen P... | Christopher Nolan | loss of lover|dream|sleep|subconsciousness|heist | 148 | ... | 2010-07-14 | 9767 | 7.9 | 2010 | 1.600000e+08 | 8.255000e+08 | 14 | 7 | 665500000 | 1 |
| 4360 | 24428 | tt0848228 | 7.637767 | 220000000 | 1519557910 | The Avengers | Robert Downey Jr.|Chris Evans|Mark Ruffalo|Chr... | Joss Whedon | new york|shield|marvel comic|comic|superhero | 143 | ... | 2012-04-25 | 8903 | 7.3 | 2012 | 2.089437e+08 | 1.443191e+09 | 25 | 4 | 1299557910 | 1 |
| 1386 | 19995 | tt0499549 | 9.432768 | 237000000 | 2781505847 | Avatar | Sam Worthington|Zoe Saldana|Sigourney Weaver|S... | James Cameron | culture clash|future|space war|space colony|so... | 162 | ... | 2009-12-10 | 8458 | 7.1 | 2009 | 2.408869e+08 | 2.827124e+09 | 10 | 12 | 2544505847 | 1 |
| 2874 | 155 | tt0468569 | 8.466668 | 185000000 | 1001921825 | The Dark Knight | Christian Bale|Michael Caine|Heath Ledger|Aaro... | Christopher Nolan | dc comics|crime fighter|secret identity|scarec... | 152 | ... | 2008-07-16 | 8432 | 8.1 | 2008 | 1.873655e+08 | 1.014733e+09 | 16 | 7 | 816921825 | 1 |
| 4363 | 68718 | tt1853728 | 5.944518 | 100000000 | 425368238 | Django Unchained | Jamie Foxx|Christoph Waltz|Leonardo DiCaprio|K... | Quentin Tarantino | bounty hunter|hero|plantation|society|friendship | 165 | ... | 2012-12-25 | 7375 | 7.7 | 2012 | 9.497443e+07 | 4.039911e+08 | 25 | 12 | 325368238 | 1 |
| 4381 | 70160 | tt1392170 | 2.571099 | 75000000 | 691210692 | The Hunger Games | Jennifer Lawrence|Josh Hutcherson|Liam Hemswor... | Gary Ross | hallucination|dystopia|female protagonist|bow ... | 142 | ... | 2012-03-12 | 7080 | 6.7 | 2012 | 7.123082e+07 | 6.564734e+08 | 12 | 3 | 616210692 | 1 |
| 5424 | 68721 | tt1300854 | 4.946136 | 200000000 | 1215439994 | Iron Man 3 | Robert Downey Jr.|Gwyneth Paltrow|Guy Pearce|D... | Shane Black | terrorist|war on terror|tennessee|malibu|marve... | 130 | ... | 2013-04-18 | 6882 | 6.9 | 2013 | 1.872067e+08 | 1.137692e+09 | 18 | 4 | 1015439994 | 1 |
| 4362 | 49026 | tt1345836 | 6.591277 | 250000000 | 1081041287 | The Dark Knight Rises | Christian Bale|Michael Caine|Gary Oldman|Anne ... | Christopher Nolan | dc comics|crime fighter|terrorist|secret ident... | 165 | ... | 2012-07-16 | 6723 | 7.5 | 2012 | 2.374361e+08 | 1.026713e+09 | 16 | 7 | 831041287 | 1 |
| 629 | 157336 | tt0816692 | 24.949134 | 165000000 | 621752480 | Interstellar | Matthew McConaughey|Jessica Chastain|Anne Hath... | Christopher Nolan | saving the world|artificial intelligence|fathe... | 169 | ... | 2014-11-05 | 6498 | 8.0 | 2014 | 1.519800e+08 | 5.726906e+08 | 5 | 11 | 456752480 | 1 |
| 4366 | 49051 | tt0903624 | 4.218933 | 250000000 | 1017003568 | The Hobbit: An Unexpected Journey | Ian McKellen|Martin Freeman|Richard Armitage|A... | Peter Jackson | riddle|elves|dwarves|orcs|middle-earth (tolkien) | 169 | ... | 2012-11-26 | 6417 | 6.9 | 2012 | 2.374361e+08 | 9.658933e+08 | 26 | 11 | 767003568 | 1 |
10 rows × 22 columns
# - - - - Question(2) - 3nd step: financial performence of the top 10 most rating movie
top_rating_financial_fig = px.bar(top_rating_movies, x=['budget','profit_value',], y='original_title',
labels={'original_title':'Movie', 'budget': 'Budget', 'profit_value':'Profit', 'variable':'Financial Performance', 'value':'Thousands of millions ($)' },
text_auto='.3s',
color_discrete_sequence=palette1 ,
title="Financial Performence of The Top 10 Most Rating Movies")
top_rating_financial_fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
top_rating_financial_fig.show()
Avatar was the most commercially successful movie of all time, earning 2.54 billion U.S. dollars at the box office. It is evident that the most successful movies of all time with the highest vote are the ones with the greatest profitability.
df_url['genres'].value_counts() # print unique values of genres
Drama 621
Comedy 584
Drama|Romance 264
Comedy|Drama 253
Documentary 243
...
Documentary|Drama|Music|Thriller 1
Comedy|Horror|Thriller|Mystery|Fantasy 1
Comedy|Family|Mystery|TV Movie 1
Mystery|Drama|Adventure 1
Mystery|Comedy 1
Name: genres, Length: 1870, dtype: int64
# - - - - Question(3) - 1st step: Sum of budgets for each genres (Grouped by)
budget_genres = df_url.groupby('genres')['budget'].sum().reset_index(name = "budget_genres")
budget_genres = budget_genres.sort_values('budget_genres',ascending=False).head(10) # get the 10 highest budgets of gebres
budget_genres
| genres | budget_genres | |
|---|---|---|
| 640 | Comedy | 6547735670 |
| 915 | Drama | 5621027156 |
| 788 | Comedy|Romance | 3269566906 |
| 1115 | Drama|Romance | 2611808060 |
| 715 | Comedy|Drama|Romance | 2126509807 |
| 682 | Comedy|Drama | 1965552010 |
| 304 | Action|Thriller | 1961700000 |
| 601 | Animation|Family | 1822500000 |
| 483 | Adventure|Fantasy|Action | 1776900000 |
| 104 | Action|Adventure|Science Fiction | 1594500001 |
# - - - - Question(2) - 2nd step: bar plot
palette2 = ["#F4B592","#D2BAC8","#E5BDB2","#D6C0B6","#CBDBCE","#F5A7A0","#8BC1C2","#91B9A3"]
top_budget_genres_fig = px.bar(budget_genres, x='budget_genres', y='genres',
labels={'genres':'Genres', 'budget_genres': 'Budget', 'value':'Thousands of millions ($)'},
text_auto='.3s',
color_discrete_sequence = palette2 ,
title="Movies Genres with highest investments")
top_budget_genres_fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
top_budget_genres_fig.show()
The plot above revealed that comedy was the highest investments genre with a budget of 6.5 billion U.S. dollars, though Drama, Comedy|Romance, and Drama|Romance were similarly high in investments.
# - - - - Question(4) - 1st step: Sum of profits for each genres (Grouped by)
profit_genres = df_url.groupby('genres')['profit_value'].sum().reset_index(name = "profit_genres")
profit_genres = profit_genres.sort_values('profit_genres',ascending=False).head(10) # get the 10 highest budgets of gebres
profit_genres
| genres | profit_genres | |
|---|---|---|
| 640 | Comedy | 13237560812 |
| 915 | Drama | 9302298294 |
| 788 | Comedy|Romance | 8061551095 |
| 483 | Adventure|Fantasy|Action | 5820583556 |
| 601 | Animation|Family | 4874729242 |
| 104 | Action|Adventure|Science Fiction | 4825984307 |
| 515 | Adventure|Fantasy|Family | 4799516484 |
| 715 | Comedy|Drama|Romance | 4708543340 |
| 364 | Adventure|Action|Science Fiction | 4362926988 |
| 1115 | Drama|Romance | 4165058837 |
# - - - - Question(2) - 2nd step: bar plot
palette3 = ["#51C3C5","#91B9A3","#51C3C5","#F1B2A0","#8B575C","#C98986","#F6BDD1","#F6E4F6","#857991"]
top_profit_genres_fig = px.bar(profit_genres, x='profit_genres', y='genres',
labels={'genres':'Genres', 'profit_genres': 'Profit', 'value':'Thousands of millions ($)'},
text_auto='.3s',
color_discrete_sequence = palette3 ,
title="Movies Genres with highest Profits")
top_profit_genres_fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
top_profit_genres_fig.show()
Between 1960 and 2015, Comedy was the highest profitable movie genre at the box office. Comedy films recorded a box office revenue of almost 13.2 billion U.S. dollars within that period. Drama movies followed with more than 9 billion dollars in revenue.
profit_genres1 = df_url.groupby('genres')['profit_value'].sum().reset_index(name = "profit_genres")
profit_genres1 = profit_genres.sort_values('profit_genres',ascending=True).head(10) # get the 10 highest budgets of gebres
profit_genres1
| genres | profit_genres | |
|---|---|---|
| 1115 | Drama|Romance | 4165058837 |
| 364 | Adventure|Action|Science Fiction | 4362926988 |
| 715 | Comedy|Drama|Romance | 4708543340 |
| 515 | Adventure|Fantasy|Family | 4799516484 |
| 104 | Action|Adventure|Science Fiction | 4825984307 |
| 601 | Animation|Family | 4874729242 |
| 483 | Adventure|Fantasy|Action | 5820583556 |
| 788 | Comedy|Romance | 8061551095 |
| 915 | Drama | 9302298294 |
| 640 | Comedy | 13237560812 |
import seaborn as sns
pallete1 = ["#F5A7A0","#91B9A3"]
fig, ax = plt.subplots(figsize=(16, 8))
voting_profit_fig =sns.scatterplot(x="revenue", y="vote_count",
hue="profitable", palette=pallete1, legend='full',
data=df_url);
It can be seen from the graph that movies with a higher revenue also have a higher vote count. 1 means it's protifable and 0 mean its not.
Although investment in the movie industry can actually be a complex and risky endeavor, it is one of the most reliable industry in getting profits, And according to this experimental investigation, we find that it’s definitely a hugely profitable business.
In this report, we investigated the relationship between box office revenue(financial statement) and different variables, including vote counting, and genre, and how it can influence the financial success of a film, but there might be different major factors including casts, date of release, keywords, and popularity have statistically significant in generating box office revenue.
Among the factors that determine movie attendance, genre was found to be the most significant. we examined after analys our data that choosing a popular genre would be positively correlated with high box office revenue. The most profitable movies consists of 10 most popular genres, including Comedy, Drama, Comedy|Romance, Adventure|Fantasy|Action, Animation|Family, Action|Adventure|Science Fiction, Comedy|Drama|Romance, Adventure|Action|Science Fiction, and Drama|Romance, while the remaining genres such as Fantasy, Family, Mystery, Documentary, and Horror are classified as ‘unpopular’.
Also looking to the genres of the most profitable movies we can observe that there are different genre classifications assigned to the same movie in an attempt to attract a broad audience in order to maximize box office revenue.
Additionally, most of the high vote-count movies were found to experience significant increases in revenue, which confirms our hypothesis. Therefore, we can conclude that the vote count of a movie does, in fact, have a significant impact on domestic box office revenue.
Finally, A profit value was calculated as a better indicator of a movie's financial success
One of the limitations that I challenged, There was some difficulty in obtaining data about revenue and budget costs of movies that were associated with the high vote-average, when trying to analyze the correlation between the vote-average and the box office revenues, which has to be ignored in this study, despite the fact that voting average plays a statistically significant role in increasing box office revenue.
However, the alternative approach was to use the vote-count to determine the financial success of a movie.